Intro to Polars

Joel Herndon

Duke University Libraries
Center for Data and Visualization Sciences

February 4, 2026

Topics covered


01. What is polars?

02. Why polars?

03. How do I import data?

04. How can I explore my data?

05. How can I transform my data?

01.
What is polars?

What is Polars?

  • High performance data processing library
  • Works in Rust, R, and Python (plus more languages…)

02.
Why polars?

Why polars?

  • Speed
  • Expressiveness
  • Strictness (consistent code)

“Come for the speed, stay for the API”
         - Janssens and Nieuwdorp, 2025

Polars in action

1. Dashboards

2. Charts

3. Summary tables

03.
How do I import data?

How do I import data?

  • Today, we will focus on loading a file of comma separated values (csv) to create a DataFrame
  • Note that polars can also read data from excel and many other formats.

Polars DataFrames


  • each column can hold a different data type
  • each column must contain the same number of elements
  • columns are labeled with unique names

Importing csv files

The read_csv() function creates polars DataFrame from a csv file.

import polars as pl
flights = pl.read_csv("data/flights.csv")
flights

Importing csv files

shape: (6, 4)
flight cost distance_km non_stop
str f64 i64 bool
"Raleigh Durham to Svalbard" null 6209 false
"Raleigh Durham to London" 1200.0 6216 true
"Raleigh Durham to Los Angeles" 600.0 3595 true
"Raleigh Durham to Boston" 250.0 985 true
"Raleigh Durham to Chicago" 300.0 1040 false
"Raleigh Durham to New York" 420.0 693 true

CSV Save

The write_csv() function creates a csv file to preserve the data.

import polars as pl
flights.write_csv("data/saved_flights.csv")

04.
How can I explore my data?

Exploring DataFrames

Exploring DataFrames


01. Evaluating data size

Evaluating size

Polars includes the shape of the Dataframe in its default output.

(shape: (6,4))

flights
shape: (6, 4)
flight cost distance_km non_stop
str f64 i64 bool
"Raleigh Durham to Svalbard" null 6209 false
"Raleigh Durham to London" 1200.0 6216 true
"Raleigh Durham to Los Angeles" 600.0 3595 true
"Raleigh Durham to Boston" 250.0 985 true
"Raleigh Durham to Chicago" 300.0 1040 false
"Raleigh Durham to New York" 420.0 693 true

Evaluating size

You can also check the properties of a DataFrame programmatically:

height for a count of rows

flights.height
6

width for a count of columns

flights.width
4

shape for height and rows

flights.shape
(6, 4)

Exploring DataFrames


01. Evaluating size

02. Browsing

Browsing

head() list the first five rows of data

flights.head()
shape: (5, 4)
flight cost distance_km non_stop
str f64 i64 bool
"Raleigh Durham to Svalbard" null 6209 false
"Raleigh Durham to London" 1200.0 6216 true
"Raleigh Durham to Los Angeles" 600.0 3595 true
"Raleigh Durham to Boston" 250.0 985 true
"Raleigh Durham to Chicago" 300.0 1040 false

Browsing

tail() list the final five rows of data

flights.tail()
shape: (5, 4)
flight cost distance_km non_stop
str f64 i64 bool
"Raleigh Durham to London" 1200.0 6216 true
"Raleigh Durham to Los Angeles" 600.0 3595 true
"Raleigh Durham to Boston" 250.0 985 true
"Raleigh Durham to Chicago" 300.0 1040 false
"Raleigh Durham to New York" 420.0 693 true

Browsing

By default, Polars will print the first and last 5 rows of a DataFrame when printed.
If you want to see more rows:

pl.Config.set_tbl_rows(100) # increase the default rows printed to 100  
polars.config.Config

Exploring


01. Evaluating data size

02. Browsing DataFrames

03. Understanding columns

Understanding columns

columns lists the column names

flights.columns
['flight', 'cost', 'distance_km', 'non_stop']


schema lists columns names and data types

flights.schema
Schema([('flight', String),
        ('cost', Float64),
        ('distance_km', Int64),
        ('non_stop', Boolean)])

Exploring DataFrames


01. Evaluating data size

02. Browsing

03. Understanding columns

04. Considering descriptive statistics

Descriptive statistics

Often, we want to assess the associated statistics for each column.

describe() provides descriptive statistics for DataFrames

flights.describe()
shape: (9, 5)
statistic flight cost distance_km non_stop
str str f64 f64 f64
"count" "6" 5.0 6.0 6.0
"null_count" "0" 1.0 0.0 0.0
"mean" null 554.0 3123.0 0.666667
"std" null 385.460763 2612.571989 null
"min" "Raleigh Durham to Boston" 250.0 693.0 0.0
"25%" null 300.0 985.0 null
"50%" null 420.0 3595.0 null
"75%" null 600.0 6209.0 null
"max" "Raleigh Durham to Svalbard" 1200.0 6216.0 1.0

Exploring data


01. Evaluating data size

02. Browsing

03. Understanding columns

04. Considering descriptive statistics

05. Handling missing data

Missing data

  • Polars represents missing data as null.
  • Polars excludes missing data from calculations
flights.head(1)
shape: (1, 4)
flight cost distance_km non_stop
str f64 i64 bool
"Raleigh Durham to Svalbard" null 6209 false

05.
How can I transform my data?

Five essential data wrangling verbs

Task polars “verb”
Subset columns select()
Subset rows filter()
Sort sort()
Create a new variable with_columns()
Aggregate by groups group_by()

5.1. Subset columns with select()

  • select() chooses columns

5.1. Subset columns with select()


  • select() chooses columns
  • The flights DataFrame has four columns.
  • Let’s select flight and cost
column description
flight destination of RDU flight
cost cost of flight
distance_km distance in km
non_stop is flight non-stop

5.1. Subset columns with select()

flights.select(pl.col('flight','cost'))
shape: (6, 2)
flight cost
str f64
"Raleigh Durham to Svalbard" null
"Raleigh Durham to London" 1200.0
"Raleigh Durham to Los Angeles" 600.0
"Raleigh Durham to Boston" 250.0
"Raleigh Durham to Chicago" 300.0
"Raleigh Durham to New York" 420.0

**Note that the columns return in the order that I request them.

5.1. Subset columns with select()

Often using the drop() method to remove columns is more efficient than select().

flights.drop(pl.col('cost'))
shape: (6, 3)
flight distance_km non_stop
str i64 bool
"Raleigh Durham to Svalbard" 6209 false
"Raleigh Durham to London" 6216 true
"Raleigh Durham to Los Angeles" 3595 true
"Raleigh Durham to Boston" 985 true
"Raleigh Durham to Chicago" 1040 false
"Raleigh Durham to New York" 693 true

5.1. Subset columns with select()

  • select() reduces the DataFrame to the essential columns and improves perfomance.

  • Polars offers numerous options for more precise selections including:

    • Regular Expressions: flights.select(pl.col('^c.*$')) # columns starting with “c”
    • Select by data type: flights.select(df.select(pl.col(pl.Int64))
    • Slicing (not recommended, but possible!): flights[:]

5.2 Subset rows with filter()

The filter command identifies rows that you wish to include in a dataframe.

flights.filter(pl.col('non_stop'))
shape: (4, 4)
flight cost distance_km non_stop
str f64 i64 bool
"Raleigh Durham to London" 1200.0 6216 true
"Raleigh Durham to Los Angeles" 600.0 3595 true
"Raleigh Durham to Boston" 250.0 985 true
"Raleigh Durham to New York" 420.0 693 true

5.2 Subset rows with filter()

Let’s say that we wanted only the flights from RDU that were:

  • a distance less than 1000 kilometers
flights.filter(pl.col('distance_km') < 1000)
shape: (2, 4)
flight cost distance_km non_stop
str f64 i64 bool
"Raleigh Durham to Boston" 250.0 985 true
"Raleigh Durham to New York" 420.0 693 true

**So, our Boston and New York flights are both under 1000 kilometers from Durham.

5.2 Subset rows with filter()

Polars also has convenient tools for finding text:

  • Is there a flight to Svalbard?
flights.filter(pl.col('flight').str.contains("Svalbard"))
shape: (1, 4)
flight cost distance_km non_stop
str f64 i64 bool
"Raleigh Durham to Svalbard" null 6209 false

**So, our Boston and New York flights are both under 1000 kilometers from Durham.

5.2 Subset rows with filter()

Polars allows complex filter queries by combining multiple conditions using boolean expressions. For most operations, you will use the following operators to build these expressions.

Operator Symbol
AND &
OR |
NOT ~

5.2 Subset rows with filter()

Let’s say that we want to see all the flights that are:

  • less than 6000 kilometers from Durham
  • less than $500
flights.filter(
    (pl.col('distance_km')<6000) &
    (pl.col('cost')< 500.00)
)
shape: (3, 4)
flight cost distance_km non_stop
str f64 i64 bool
"Raleigh Durham to Boston" 250.0 985 true
"Raleigh Durham to Chicago" 300.0 1040 false
"Raleigh Durham to New York" 420.0 693 true

5.2 Subset rows with filter()

Final thoughts on filter-ing:

  1. Polars requires explicit conditions
  2. Polars requires you to enclose each condition in parentheses.
  3. Using software that highlights parentheses can be helpful.

5.3 sort()

sort() allows us to specify how the rows in the DataFrame should be ordered.

5.3 sort()

Let’s try a simple sort based on our filtered data.

  • We will sort by distance for non-stop flights.
flights.filter(pl.col('non_stop')).sort(pl.col('distance_km'))
shape: (4, 4)
flight cost distance_km non_stop
str f64 i64 bool
"Raleigh Durham to New York" 420.0 693 true
"Raleigh Durham to Boston" 250.0 985 true
"Raleigh Durham to Los Angeles" 600.0 3595 true
"Raleigh Durham to London" 1200.0 6216 true

5.3 sort()

In the above, note how sort defaults to ascending order.

  • If you want to sort in descending order:
flights.filter(pl.col('non_stop')).sort(pl.col('distance_km'), descending=True)
shape: (4, 4)
flight cost distance_km non_stop
str f64 i64 bool
"Raleigh Durham to London" 1200.0 6216 true
"Raleigh Durham to Los Angeles" 600.0 3595 true
"Raleigh Durham to Boston" 250.0 985 true
"Raleigh Durham to New York" 420.0 693 true

5.3 sort()

It is entirely possible to specify more than one column for a sort. - Let’s look at our non-stop flights sort by the cost and the distance.

(
flights
    .filter(pl.col('non_stop'))
    .sort(
        [
          pl.col('cost'),
          pl.col('distance_km')
        ],
        descending=[True, False]          
    )
)
shape: (4, 4)
flight cost distance_km non_stop
str f64 i64 bool
"Raleigh Durham to London" 1200.0 6216 true
"Raleigh Durham to Los Angeles" 600.0 3595 true
"Raleigh Durham to New York" 420.0 693 true
"Raleigh Durham to Boston" 250.0 985 true

5.3 sort()

If you have missing data in a sort column, it will be listed first.

  • Show null values last by adding the nulls_last=True parameter:
flights.sort(pl.col('cost'), descending=True, nulls_last=True)
shape: (6, 4)
flight cost distance_km non_stop
str f64 i64 bool
"Raleigh Durham to London" 1200.0 6216 true
"Raleigh Durham to Los Angeles" 600.0 3595 true
"Raleigh Durham to New York" 420.0 693 true
"Raleigh Durham to Chicago" 300.0 1040 false
"Raleigh Durham to Boston" 250.0 985 true
"Raleigh Durham to Svalbard" null 6209 false

5.4 with_columns()

with_columns() provides a convenient way to add columns to DataFrames.

  • First, we assign a column name using “keyword argument syntax” (cost_per_kilometer)
  • Then we calculate the cost per kilometer
flights.with_columns(
   cost_per_kilometer = pl.col('cost') / pl.col('distance_km')
)
shape: (6, 5)
flight cost distance_km non_stop cost_per_kilometer
str f64 i64 bool f64
"Raleigh Durham to Svalbard" null 6209 false null
"Raleigh Durham to London" 1200.0 6216 true 0.19305
"Raleigh Durham to Los Angeles" 600.0 3595 true 0.166898
"Raleigh Durham to Boston" 250.0 985 true 0.253807
"Raleigh Durham to Chicago" 300.0 1040 false 0.288462
"Raleigh Durham to New York" 420.0 693 true 0.606061

5.4 with_columns()

Polars only makes changes to dataframes permanent when you save the change.

  • Even though I calculated “cost_per_kilometer” on the last slide…
  • The column was not added to the DataFrame (any guesses as to why?)
flights
shape: (6, 4)
flight cost distance_km non_stop
str f64 i64 bool
"Raleigh Durham to Svalbard" null 6209 false
"Raleigh Durham to London" 1200.0 6216 true
"Raleigh Durham to Los Angeles" 600.0 3595 true
"Raleigh Durham to Boston" 250.0 985 true
"Raleigh Durham to Chicago" 300.0 1040 false
"Raleigh Durham to New York" 420.0 693 true

5.4 with_columns()

Polars only makes changes to dataframes permanent when you save the change.

  • Let’s save the cost_per_kilometer permanently to our DataFrame.
  • Note that this example assigns the output to flights:
flights = flights.with_columns(
   cost_per_kilometer = pl.col('cost') / pl.col('distance_km')
)
flights
shape: (6, 5)
flight cost distance_km non_stop cost_per_kilometer
str f64 i64 bool f64
"Raleigh Durham to Svalbard" null 6209 false null
"Raleigh Durham to London" 1200.0 6216 true 0.19305
"Raleigh Durham to Los Angeles" 600.0 3595 true 0.166898
"Raleigh Durham to Boston" 250.0 985 true 0.253807
"Raleigh Durham to Chicago" 300.0 1040 false 0.288462
"Raleigh Durham to New York" 420.0 693 true 0.606061

5.4 with_columns()

Polars offers a second syntax style for creating columns.

  • Expression based syntax assigns the column name using the .alias() method.
flights = flights.with_columns(
   (pl.col('cost') / pl.col('distance_km')).alias('cost_per_kilometer')
)
flights
shape: (6, 5)
flight cost distance_km non_stop cost_per_kilometer
str f64 i64 bool f64
"Raleigh Durham to Svalbard" null 6209 false null
"Raleigh Durham to London" 1200.0 6216 true 0.19305
"Raleigh Durham to Los Angeles" 600.0 3595 true 0.166898
"Raleigh Durham to Boston" 250.0 985 true 0.253807
"Raleigh Durham to Chicago" 300.0 1040 false 0.288462
"Raleigh Durham to New York" 420.0 693 true 0.606061

5.5 group_by()

group_by() groups rows defined by one or more variables.

  • Note that group_by() requires additional code to use the groupings!
flights.group_by(pl.col('non-stop')) # this creates groups... but no output!
<polars.dataframe.group_by.GroupBy at 0x10aa58830>

5.5 group_by()

How many flights are non-stop (true) and “multi-stop” (false)?

  • group_by() combined with len() (length) will count the rows in each group!
flights.group_by(pl.col('non_stop')).len()
shape: (2, 2)
non_stop len
bool u32
false 2
true 4

5.5 group_by()

If you have multiple aggregations that you wish to perform on each group of data

  • agg()(or aggregrate) allows one or more calculations per grouping
  • note: you must use expression based syntax if you use .agg() after a group_by()
(
flights.group_by(pl.col('non_stop'))
    .agg(
        pl.col('distance_km').mean().alias('average_distance_km')
    )
)
shape: (2, 2)
non_stop average_distance_km
bool f64
true 2872.25
false 3624.5

5.6 over()

I know…

I said I was only going to show you five common “verbs” in Polars…

  • group_by() provides a convenient way to “collapse” data into groups…
  • but, sometimes you want the group calculation in the original DataFrame
  • This is where the over() method comes in handy.

5.6 over()

Let’s calculate the average distance over non-stop and multi-stop flights. - We will insert the calculation for each group in the DataFrame.

(
flights.
  with_columns(pl.col('distance_km')
    .mean()
    .over(pl.col('non_stop'))
    .alias('average_distance_km'))
  .select(pl.col('flight','non_stop','average_distance_km'))
)
shape: (6, 3)
flight non_stop average_distance_km
str bool f64
"Raleigh Durham to Svalbard" false 3624.5
"Raleigh Durham to London" true 2872.25
"Raleigh Durham to Los Angeles" true 2872.25
"Raleigh Durham to Boston" true 2872.25
"Raleigh Durham to Chicago" false 3624.5
"Raleigh Durham to New York" true 2872.25

06.
Polars in practice

Big data and Polars

  • Let’s explore a large dataset!
    • Roughly 1 million rows of FAA air travel data
    • Raleigh Durham (RDU) airport from 2010 to 2024
    • Special thanks to Simon Couch and colleagues for the “anyflights” package used to secure these data (see final slides for details)

Loading our RDU flights data

If using VSCode or Python IDE:

import polars as pl
rdu_flights = pl.read_parquet("rdu_flights.parquet") 

If you are using Google Colab:

import polars as pl
from google.colab import files
files.upload() # choose rdu_flights.parquet
rdu_flights = pl.read_parquet("rdu_flights.parquet") 

NOTE - you will need to specify the path to the data

1. Exploring our RDU flights data

  • How many rows and columns in this Dataframe?
  • rdu_flights.shape
  • rdu_flights.height and rdu_flights.width will also give you the answer
  • The answer: (954988, 45)
    • 954,988 rows/flights
    • 45 columns

1. Exploring our RDU flights data

  • What are the column names? (What is covered?)
    • rdu_flights.columns
  • BONUS: if I wanted to see the column types, how would I do that?
    • rdu_flights.schema

2. Departing flights by year

What are the number of departing flights at RDU airport by year? (Bonus: sort by year)

  • rdu_flights.group_by('year').len()
  • OR
  • rdu_flights.group_by('year').agg(pl.len().alias('flights'))

2. Departing flights by year

rdu_flights.group_by('year').agg(pl.len().alias('flights')).sort('year')
shape: (15, 2)
year flights
i64 u32
2010 48924
2011 42620
2012 44016
2013 47983
2014 38813
2015 34898
2016 104895
2017 71874
2018 121448
2019 128312
2020 67720
2021 44136
2022 54049
2023 59416
2024 45884

3. Departure delays by year

  • Let’s do this one together…
  • Are departure delays increasing or decreasing at RDU each year?

3. Departure delays by year

Are departure delays increasing or decreasing at RDU each year?

(
rdu_flights
    .group_by('year')
    .agg(pl.col('dep_delay')
        .filter(pl.col('dep_delay') > 15)
        .len()
        .alias('delayed_flights')
    )
    .sort('year')
)

3. Departure delays by year

shape: (15, 2)
year delayed_flights
i64 u32
2010 7339
2011 6234
2012 6099
2013 8003
2014 6757
2015 5623
2016 16182
2017 11952
2018 23592
2019 23146
2020 5082
2021 5674
2022 9896
2023 10762
2024 9260

04.
Polars resources

General Resources

  • CDVS - Duke Libraries - askdata@duke.edu
    As always, Duke Libraries Center for Data and Visualization Science (askdata@duke.edu) can assist with questions about data management and data wrangling. Consultations are available by appointment.

  • Polars API
    The Polars Python API is an outstanding resource for the latest syntax. If you are questioning the validity of AI suggestions (and sometimes those suggestions are erroneous!), the API can help resolve questions.

eBook Resources

Duke Libraries subscribe to the O’Reilly for Higher Education Database where you will find Jeroen Janssens and Thijs Nieuwdorp’s Python Polars: The Definitive Guide. This is an excellent way to learn Polars and serves as compelling reference.

Data Resources

I used FAA flight data pulled using the anyflights API. If you would like to investigate this API further check out:

Couch S (2023). anyflights: Query ‘nycflights13’- Air Travel Data for Given Years and Airports. Code: https://github.com/simonpcouch/anyflights, Context: https://simonpcouch.github.io/anyflights/.

Other polars APIs

As mentioned in the introduction, Polars is written in Rust and has implementations in Julia, R (check out TidyPolars , and Javascript. If Polars seems compelling and you regularly use one of those languages, please try one of the other implementations!